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Summary. The varying-coefficient model is an important nonparametric statistical model that allows us to ex- 
amine how the effects of covariates vary with exposure variables. When the number of covariates is big, the 
issue of variable selection arrives. In this paper, we propose and investigate marginal nonparametric screen- 
ing methods to screen variables in ultra-high dimensional sparse varying-coefficient models. The proposed 
nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparamet- 
ric marginal contributions of each covariate given the exposure variable. The sure independent screening 
property is established under some mild technical conditions when the dimensionality is of nonpolynomial 
order, and the dimensionality reduction of NIS is quantified. To enhance practical utility and the finite sample 
performance, two data-driven iterative NIS methods are proposed for selecting thresholding parameters and 
variables: conditional permutation and greedy methods, resulting in Conditional-INIS and Greedy-INIS. The 
effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data 
applications. 

Keywords: Sure independence screening; Variable selection; Sparsity; Conditional permutation; False posi- 
tive rates 
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1. Introduction 

The development of information and technology drives big data collections in many areas of advanced 
scientific research ranging from genomic and health science to machine learning and economics. The col- 
lected data frequently has an ultra-high dimensionality p that is allowed to diverge at nonpolynomial (NP) 
rate with the sample size n, namely log(p) — 0(nP) for some p > 0. For example, in biomedical research 
such as genomewide association studies for some mental diseases, millions of SNPs are potential covari- 
ates. Traditional statistical methods face significant challenges in dealing with such a high-dimensional 
problem with large sample sizes. 

With the sparsity assumption, variable selection helps improve the accuracy of estimation and gain 
scientific ins ights. Many significant var iable selec tion techniques h ave been developed, such as Bridge 
re gression in 
in 



Frank and Friedman Q993 ), Lasso in Tibshirani fl996l) . SCAD and fo lded conca ve penalty 



.Fan and Li ('2001), the Elastic net in Zou and Hastie ('2005 ). Adaptive Lasso ( Zou. 20061) . and the 
Dantzig selector in Icandes and Tao (2007 ). Methods on the impleme ntation of folded c oncave penal- 



ized least-sq uares include the local linear approximation algorithm in IZou and Li (2008 ) and the plus 
algorithm in IZhang However, due to the simultaneous challenges of computational expediency, 

statistical accuracy and algorithmic stability, these methods do not perform well in ultra-high dimensional 
problems. 

To tackle these problems. iFan and Lv (2008 ) introduced a sure independence screening (SIS) method 
to select important variables in ultra-high dimensional linear regression models via marginal correlation 
learning. iHall a nd Mille r (20091) extended the m ethod to the generalized correlation ranking, which was 
further extended bv .Fan. Feng and Song (201 if ) for ultra-high dimensional nonpar ametric additive mod- 
els, resulting in nonparametric independence screening (NIS). On a different front. iFan and Song r2010l) 
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extended the SIS idea to ultra-high dimensional generalized linear models and devised a useful technical 
tool for establishing the sure scr eening results and bounding false selection rates. Other related meth- 
ods incl ude data-tilling me thod ( Hall. Titterington and Xue. 2009h . marginal part ial likelihood me thod 



MPLE (|Zhao and Li 2010l). and robust scree ning methods by rank correlation (|Li. et al. . 20121 ) and 



distance correlation (jLi. Zhong and Zhu. 20121 ). Inspired by these previous work, our study will focus on 
variable screening in nonparametric varying-coefHcient models with NP dimensionality. 

It is well known that nonparametric models are flexible enough to reduce modeing biases. However, 
they suffer from the so-called "curse of dimensionality" . A remarkable simple and powerful nonparametric 
model for dimensionality reductions is the varying-coefficient model, 

Y = (3'^{W)X + e, (1) 

where X = {Xi, ■ ■ ■ , Xp)"^ is the vector of covariates, W is some observable exposure variables, Y is 
the response, and e is the random noise with conditional mean and finite conditional variance. An 
intercept term (i.e., Xq = 1) can be introduced if necessary. This model assumes that the variables in the 
covariate vector X enter the model linearly, meanwhile it allows regression coefficient functions to very 
smoothly with the exposure variable. The model retains general nonparametric characteristics and allows 
the nonlinear interactions between the exposure variable W and the covariates. It arises frequently from 
economics, fin ance, politics, epidemiology, medical science, ecology, among others. For an overview, see 



Fan and Zha ng (2008). 



When the dimensionality p is finite. Fan. Zhang and Zhang f200lh proposed the generalized likelihood 



ratio (GLR) test to select variables in the varying-coefficient model Fo r the time- varying coefficient 
model, a special case of ([T]) with the exposure variable W being the time t, Wang. Li and Huang (20081 ) 



applied the basis function approximatio ns and the SCAD penalty to address the problem of variable 



selection. In the NP dimensional setting, iLian (20111 ) utilized the adaptive group Lasso penalty in time- 



varying coefficient models. These methods still face the aforementioned three challenges. 

In this paper, we consider a nonparametric screening by ranking a measure of the marginal non- 
parametric contributions of each covariate given the exposure variable. For each given covariate, we fit 
marginal regressions of the response Y against the covariate Xj (j = 1, • • • ,p) conditioning on W: 

mmE[{Y~a,-b,Xjf\W] (2) 

Let aj{W) and bj{W) be the solution to ^ and anj{W) and bnj{W) be their nonparametric estimates. 
Then, we rank the importance of each covariate in the joint model according to a measure of marginal 
utility (which is equivalent to the goodness of fit) in its marginal model. Under some reasonable condi- 
tions, the magnitude of these marginal contributions provides useful probes of the importance o f variables 



in the joint varying-coefficient model. This is an important extension of SIS ijFan and Lv. 20081 ) to a more 
flexible class of varying coefficient models. 

The sure screening property of NIS can be established under certain technical conditions. In some 
very specific cases, NIS can even be model selection consistent. In establishing this kind of results, three 
factors are related to the minimum distinguishable marginal signals: the stochastic error in estimating 
the nonparametric components, the approximat ion error in modeh ng n onparametric components, and 



the tail distributions of the covariates. Following Fan and Lv (20081 ) and Fan. Feng and Song (20 111 ), we 



propose two nonparametric independence screening approaches in an iterative framework. One is called 
Greedy-INIS, in which we adopt a greedy method in the variable screening step. The other is called 
Conditional-INIS which is built on conditional random permutation to determine a data driven screening 
threshold. They both serve to effectively control the false positive rate and false negative rate with 
enhanced performance. 

This article is organized as follows. In Section 2, we fit each marginal nonparametric regression 
model via B-spline basis approximation and screen variables by ranking a measure of these estimators. 
In Section 3, we establish the sure screening property and model selection consistency under certain 
technical conditions. Iterative NIS procedures (namely Greedy-INIS and Conditional-INIS) are developed 
in Section 4. In Section 5, a set of numerical studies are conducted to evaluate the performance of our 
proposed methods. 
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2. Models and Nonparametric Marginal Screening Method 

In this section we study the varying-coefScient model with the conditional linear structure as in ([T]). As- 
sume that the functional coefficient vector (3{-) — (/3i(-), • • • , /3p(-))-^ is sparse. Let = {j : E[Pj{W)] > 
0} be the true sparse model with nonsparsity size s„ = |A^*|. We allow p to grow with n and denote it 
by Pn whenever necessary. 



2. 1. Marginal Regression 

For j = 1, • • • ,p, let aj{W) and bj{W) be the minimizcr of the following marginal regression problem: 

,w.J^^,.r ,^MY-a,{W)-b,{W)X,r\W], (3) 

aj {W),bj(W)£L2{P) 

where P denotes the joint distribution of {Y, W, X) and L2{P) is the class of square integrable functions 
under the measure P. By some algebra, we have that the minimizcr of ([3]) is 

^^"^^^ ^ ^Vart^lwT^ ' "^'^^^ = ^[^'^1 " 6, WE[X,|T4^]. (4) 
Let ao{W) = Fi[Y\W], we rank the marginal utility of covariates by 

u, = \\aj{W) + b,{W)Xjf - WaoiWW, (5) 
where = E/^. It can be seen that 

■ {Cov[X„Y\W]) 



nb^^iW){X,-nX,\W]f]^E 



VaT[Xj\W] 



(6) 



For each j = 1, - ■ ■ ,p, if Var[Xj|W^] ~ 1, then Uj has the same quantity as the measure of marginal 
functional coefficient |j6j(VF)|p. On the other hand, this marginal utility is closely related to the condi- 
tional correlation between X'.s and Y, as uj = if and only if Cov[Xj, Y\W] = almost surely. 



2.2. Marginal Regression Estimation with B-spline 

To obtain an estimate of the marginal utility Uj, j = 1, • • • ,p, we approximate aj{W) and bj{W) by 
functions in 5„, the space of polynomial splines of degree Z > 1 on W, a compact set. Let {Bk,k = 
1, • • • , Ln} denote its normalized B-spline basis with ||i?fe||oo ^ 1; where |j • ||oo is the sup norm. Then 

k=l 

k=l 

where {0jk}kZi ^-nd {?7jfc}fc=i are scalar coefficients. 

We now consider the following sample version of the marginal regression problem: 

n 

^ ..^i^. n ^(^^ " ^^^^^"1^ - ^me,X,,r, (7) 

where rj^ = (77,1, • • • , yy.xj^, 0j = (^,1, • • • , 0,lJ^ and B(.) = {B,{-), • • • , BlJ-)). 
It is easy to show that the minimizers of ([7]) is given by 

ivj, e]f = (Q^^.Q„^.)"'Q^,Y, (8) 

where 

/ B(M^i), X,iB(W^i) \ 

Q„^ = (B„, = ; ; 

V B(W„), X,^B{Wn) ) 
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is an n x 2L„ matrix. As a result, the estimates of aj and bj, j — 1^ ■ ■ ■ ,p are given by 

K,iW) = BiW)d, = iOl^,BiW)ml,Qr.,r'Ql,Y, (9) 

where Ol„ is an i„-dimension vector with all entries 0. Similarly, we have the estimate of the intercept 
function ag by 

aMW) = B{W)rio = B(VF)(B^B„)-1B^Y, (10) 

where 

rJo=arg min ^ f](r, - B(iyO»7o)'- (11) 

We now define an estimate of the marginal utility Uj as 

Ur., = ||S„,(W)+6„,(W)X,||2 -||S„o(W)||2 

where W = (Wi, • • • , Wn)'^ ■ Note that throughout this paper, whenever two vectors a and b are of the 
same length, ab denotes the componentwise product. Given a predefined threshold value r„, we select a 
set of variables as follows: 

Xr„ = {1 < j < P : > r„}. (13) 

Alternatively, we can rank the covariates by the residual sum of squares of marginal nonparametric 
regressions, which is defined as 

- 1|Y - a„,(W) - 6„,(W)X,||2, (14) 

and we select variables as follows, 

M^^^{l<j <p:vnj <l^n}, (15) 

where Vn is a predefined threshold value. 

It is worth noting that ranking by marginal utility Unj is equivalent to ranking by the measure of 
goodness of fit Vnj- To see the equivalence, first note that 

\\an,m + 5„,(W)X,||2 = iY^Q„/Q^^.Q„^.)^'Q^,Y, (16) 

and 

1 " - 1 

- ^ y.(a„,(T^,) + b,,,{W,)X,,) = -Y^Q„^.(Q^^.Q„^.)-'Q^,Y. (17) 

i—l 

It follows from and dTT]) that 

= ||Y||2-||a„o(W)||2-^2„,-. (18) 

Since the first two terms on the right hand side of ^TE\\ do not vary in j, ranking by Unj is the same as 
that by Vnj- Therefore, selecting variables with large marginal utility is the same as picking those that 
yield small marginal residual sum of squares. 

To bridge Uj and Unj, we define the population version of the marginal regression using B-spline 
basis. From now on, we will omit the argument in B(VF) and write B whenever the context is clear. Let 
^i(^) = ^^"l ^j(^) — ^^ji where fj^ and Oj are the minimizcr of 



min E[(y - Br/,- - B6»,-X,-)^l, 



(19) 
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and ao{W) — BrjQ, where fj^ is the minimizer of 

min E[(r-Br7o)2]. (20) 

It can be seen that 

Ca,iW)MW)f = diag(B,B)(E[QjQ^.])"'E[Qjr], (21) 
ao{W) = B(E[B'^B])-^E[B'^y], (22) 

where = (B,XjB) 

u, ^ \\~a,{W)+b,{W)X,f-\\~a,(WW 

= E[rQ^.](E[QjQ,])-'E[Qjy]-E[yB](E[B^B])-iE[B^y]. (23) 



3. Sure Screening 

In this section, we estabhsh the sure screening properties of the proposed method for model Recall 
that by ([6]) the population version of marginal utility quantifies the relationship between X'^s and Y as 
follows: 



E 



{Goy[X,,Y\W]f 



Yai[Xj\W] 



j = !,•••, p. (24) 



Then the following two conditions guarantee that the marginal signal of the active components {mjIjgai, 
does not vanish. 

(i) Suppose for j — 1, • • • ,p, Var[Xj|iy] is uniformly bounded away from and infinity on W, where W 

is the compact support of W . That is, there exist some positive constants hi and ft,2, such that 

< /li < Y&Y[Xj\W] <h2<00. 

(ii) miuj-gTvi, E[(Cov[Xj, F|iy])^] > ciL„n~^'*, for some k > and ci > 0. 
Then under conditions (i) and (ii), 

min Uj > ciL„n^^'*'//i2. (25) 

Note that in condition (ii), the number of basis functions L„ is not intrinsic. By the Remark [T] below, 
Ln should be chosen in correspondence to the smoothness condition of the nonparametric component. 
Therefore, condition (ii) depends only on k and smoothness parameter d in condition (iii). We keep L„ 
here to make the relationship more explicit. 



3. 1. Sure Screening Properties 

The following conditions (iii)-(vii) are required for the B-spline approximation in marginal regressions 
and establishing the sure screening properties. 

(iii) The density function g of is bounded away from zero and infinity on W. That is, < Ti < 
g{W) < T2 < oo for some constants Ti and T2. 

(iv) Functions {ojI^^q and {bj}^^^ belong to a class of functions B, whose rth derivative /^'"^ exists and 
is Lipschitz of order a. That is, 

= {/(•) : \f^''Hs)-f'-''Ht)\<M\s-tr f0Ts,teW}, 

for some positive constant M, where r is a nonnegative integer and a € (0,1] such that d — r + a > 
0.5. 
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(v) Suppose for all j — 1, - ■ ■ ,p, there exists a positive constant Ki and ri > 2, such that 

P{\X,\ > t\W) < exp(l - {t/K,Y^), (26) 

uniformly on W, for any t > 0. Furthermore, let to(X*) = E[y|X, VF], where X* = {X^,W)'^. 
Suppose there exists some positive constants K2 and r2 satisfying rir2/(ri + > 1, such that 

P(|to(X*)| > t\W) < exp(l - {t/K2y'). (27) 

uniformly on W, for any t > 0. 

(vi) The random errors {ei}"^i are i.i.d with conditional mean 0, and there exists some positive constants 
if 3 and r3 satisfying rir^/(ri + r-^) > 1, such that 

P(|£|>t|l^)<exp(l-(</if3)''^), (28) 
uniformly on W, for any t > 0. 

(vii) There exists some constant ^ e (0, l//i2) such that L^^''^^ < ci(l//i2 — ^)?i^^''/Mi. 
Proposition 1. Under conditions (i)-(v), there exists a positive constant Mi such that 

Uj - Uj < MiL-^'^. (29) 
In addition, when L^^"^^^ < ci(l//i2 — £,)n^'^'^ / Mi for some ^ E (0, l//i2), we have 

min u-i > cifL„n~^''. (30) 

Remark 1. It follows from Proposition{^that the minimum signal level of {uj}jizMt '■s approximately 
the same as {uj}j^M,, provided that the approximation error is negligible. It also shows that the number 
of basis functions L„ should be chosen as 

for some positive constant C. In other words, the smoother the underlying function is (i.e., the larger d 
is), the smaller Ln we can take. 

The following Theorem [T] provides the sure screening properties of the nonparametric independence 
screening method proposed in Section [521 

Theorem 1. Suppose conditions (i)-(vi) hold. 

(i) If n^^^'^L:^^ — >■ 00 as n 00, then for any C2 > 0, there exist some positive constants C3 and C4 
such that 

P ( max \unj — Uj\ > C2Lnn^'^'^ 
\i<j<P 

< 12p„L„{(2 + Ln) exp(-C3ni-4''L-3) ^ 3^^^ expi-aL-^n)}. (31) 

(ii) If condition (vii) also holds, then by taking Tn = c^Lnn~^'^ with C5 = ci^/2, there exist positive 
constants cg and cy such that 

p(^M.ClMr„) > l-12s„L„{(2 + i„)exp(-C6ni-^"L-3) 

+3L„exp(-C7L;;^i)}. (32) 
Remark 2. According to Theorem{l\, we can handle NP dimensionality 

p = o(exp{ni-4«i-3})_ 

It .shows that the number of spline bases Ln also affects the order of dimensionality: the smaller Ln is, 
the higher dimensionality we can handle. On the other hand. Remark [7] points out that it is required 
Ln ^ Cn^'^/^'^'^^^^ to have a good bias property. This means that the smoother the underlying function 
is (i.e. the larger d is), the smaller Ln we can take, and consequently higher dimensionality can be 
handled. The compatibility of these two requirements requires that n < {d + 0.5)/(4c?+ 5), which implies 
that K < 1/4. We can take Ln = 0(n^/*^^''+^^), which is the optimal convergence rate for nonparametric 
regression (Stone, 1982). In this case, the allowable dimensionality can be as high as 

2(d-l) 

p — o(exp{n }). 
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3.2. False Selection Rates 

According to ([3D]), the ideal case for vanishing false-positive rate is when 

max Uj = o{Lnn^'^'^) 

so that there is a natural separation between important and unimportant variables. By Theorem [iji) , 
when (j3ip tends to zero, we have with probability tending to 1 that 

max Unj < cLnU"'^'^, for any c > 0. 

Consequently, by choosing t„ as in Theorem [TJii), NIS can achieve the model selection consistency under 
this ideal situation, i.e., 

p(^Mr,, =l-o(l). 

In particular, this ideal situation occurs under the partial orthogonality condition, i.e., {Xj}j^M, is 
independent of given W, which implies uj = for j ^ 

In general, the model selection consistency can not be achieved by a single step of marginal screening. 
The marginal probes can not separate important variables from unimportant variables. The following 
Theorem [5] quantifies how the size of selected models is related to the matrix of basis functions and the 
thresholding parameter t„. 

Theorem 2. Under the same conditions in Theorem[I\ for any Tn = c^Lnn^^'^ , there exist positive 
constants cg and cg such that 

P{|A?^J <0(n'"A„,ax(S))} > l-12p„L„{(2 + L„)exp(-C8ni-4«X-3) 

+3i„cxp(-C9ni,7^)|, (33) 
where S = E[Q'^Q], and Q = (Q^, • • • , Q ) is a functional vector of 2pnL„ dimension. 



4. Iterative Nonparametric Independence Screening 

As Fan and Lv points out, in practice the nonparametric independence screening (NIS) would still 

suffer from false negative (i.e., miss some important predictors that are marginally weakly correlated but 
jointly correlated with the response), and false positive (i.e., select some unimportant predictors which 
are highly correlated with the important ones). Therefore, we adopt an iterative framework to enhance 
the performance of this method. We repeatedly apply the large-scale variable screening (NIS) followed 
by a moderate-scale variable selection, where we use group-SCAD penalty as our selection strategy. In 
the NIS step, we propose two methods to determine a data-driven threshold for screening, which result 
in Conditional-INIS and Greedy-INIS, respectively. 



4.1. Conditional-INIS Method 

The conditional-INIS method builds upon conditio nal random permutation in determining the thresh- 
olding Tn- Recall the random permutation used in iFan. Feng and Song f201l[ ) , which generalizes that 
Zhao and Li (2010h . Randomly permute Y to get Ytt = (IVu • • • j ^7r„)"^ and compute , where tt is a 



permutation of {1, • • • , n}, based on the randomly coupled data {(Yt^. , Wi, 'Ki)}"^i that has no relation- 
ship between covariates and response. Thus, these estimates serve as the baseline of the marginal utilities 
under the null model (no relationship). To control the false selection rate at q/p under the null model, 
one would choose the screening threshold be Tq, the gth-ranked magnitude of {u^j, j = 1, ■ • • Thus, 
the NIS step selects variables {j : Unj > Tq}. In practice, one frequently uses q = 1, namely, the largest 
marginal utility under the null model. 

When the correlations among covariates are large, there will be hardly any differentiability between 
the marginal utilities of the true variables and the false ones. This makes the selected variable set very 
large to begin with and hard to proceed the rest of iterations with limited false positives. For numerical 
illustrations, see section 15.21 Therefore, we propose a conditional permutation method to tackle this 
problem. Combining the other steps, our Conditional-INIS algorithm proceeds as follows. 
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0. For j — 1, • • • ,p, compute 

u^, = ||S„,(W) +S„,(W)X,||2 - ||a„o(W)||2, 

where the estimates are defined in ^ and (ITUl) using {(Y, W, Xj), j = 1, • • • Select the top 
K variables by ranking their marginal utilities resulting in the index subset A^o to condition 
upon. 

1. Regress Y on {(W,Xj),j e A^o}i and get intercept /3„o(W^) and their functional coefficients' estima- 

tors {/3nj(M^),j G A^o}- Conditioning on A^Oi the n-dimensional partial residual is 

Y* = Y - ^„o(W) - ^ X,^„,(W). 

For all j e compute M*y using {(Y*, W, Xj), j e A^g}, which measures the additional utility 
of each covariate conditioning on the selected set A^o- 

To determine the threshold for NIS, we apply random permutation on the partial residual Y*, 
which yields Y^. Compute m*J based on the decoupled data {(Y^, W, Xj), j £ M.%\- Let r* be 
the gth-ranked magnitude of {u*^, j £ Alg}- Then, the active variable set of variables is chosen as 

-Ai = {j : > r;, j e M.%\ U Xq- 

In our numerical studies, 9 = 1. 

2. Apply the group-SCAD penalty on A\ to select a subset of variables M.\. Details about the imple- 

mentation of SCAD will be described later. 

3. Repeat step 1-2, where we replace A^o in step 1 by A^;, Z = 1, 2, • • •, and get Ai^^i and A^;+i in step 

2. Iterate until A^(+i = A^fc for some fc < / or |A^i+i| > C„, for some prescribed positive integer 

Cn. 



4.2. Gre edy-INIS Method 

Following iFan. Feng and Song (2oT)h . we also implemented a greedy version of INIS method. We skip 
step and start from step 1 in the algorithm above (i.e., take A4q = 0), and select the top po variables 
that have the largest marginal norms Unj- This NIS step is followed by the same group-SCAD penalized 
regression as in step 2. We then iterate these steps until there are two identical subsets or the number of 
variables selected exceeds a prespecified Cn- In our simulation studies, po is set as 1. 



4. 3. Implementation of SCA D 

In the group-SCAD step, variables are selected as M.i — {j e Ai : 7^' 7^ 0} through minimizing the 
following objective function: 



1 2 



3&A1 



(34) 



where ||7j ||b = V « I^"=i(Sfc=i Bjk{Wi)^jkY, and px{-) is the SCAD penalty such that 



p'x{\x\) = A/(|:r|<A)-f ^^^-^/(N>A), 



a - 1 



with pa(0) = 0. We set a — 3.7 as suggested and solve the optimization above via local quadratic 
approximations (jFan and Li. 2001). A is chosen by BIC c riteri a nlog(g^) -I- fcLnlog n, where k is the 
number of covariates chosen. Bv lAntoniadis and Fan (20011) and I Yuan and Lin (20061 ). the norm-penalty 
in ((M| encourages the group selection. 



5. Numerical Studies 
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In this section, we carry out several simulation studies to assess the performance of our proposed methods. 
If not otherwise stated, the common setup for the following simulations are: cubic B-spline, L„ = 7, 
sample size n = 400 , the number of variables p ~ 1000, and the number of simulations N = 200 for each 
example. 



5. 1 . Comparison of Minimum Model Size 

In this study, as in and Song 12013), we illustrate the performance of NIS method in terms of the 
minimum model size (MMS) needed to include all the true variables, i.e., to possess sure screening 
property. 



Example 1 Following iFan and Song we first consider a linear model as a special case of the 



varying coefficient model. Let {^fel^^i be i.i.d. standard normal random variables and 
Xk = YS-iy+^X,/^ + ^l-^a, k = 951, • • • , 1000, 

where {Cfelfc^^gsi are standard normal random variables. We construct the following model: Y = /3"^X + e, 
where e ~ A/'(0, -y/S ) and ^ = (1, —1, 1, —1, • • •)^ has s nonzero components. To carry out NIS, we define 
an exposure W independently from the standard uniform distribution. 

We compare NIS, Lasso and SIS (independence screening for linear models). The boxplots of minimum 
model size are presented in Figure[T] Note that when s > 5, the irrepresentable condition fails, and Lasso 
performs badly even in terms of pure screening. On the other hand, SIS performs better than NIS because 
the coefficients are indeed constant, and there are fewer parameters (p) involved in SIS than those of NIS 
ijpLn)- 



Example 2 For the second example, we illustrate that when the underlying model's coefficients are indeed 
varying, we do need nonparametric independence screening. Let {[/i, • • ■ be i.i.d. uniform 

random variables on [0, 1], based on which we construct X and W as follows: 

^ l+ti l+<2 

where ti and ^2 controls the correlation among the covariates X and the correlation between X and W ^ 
respectively. When ti = 0, Xj^s are uncorrelated, and when ti = 1 the correlation is 0.5. If ti = t2 = 1, 
Xj's and W are also correlated with correlation coefficient 0.5. 
For the varying coefficients part, we take coefficient functions 

j3T^{W) = W, (32{W) = {2W - if, f33iW) = sin(27rM^). 

The true data generation model is 

Y = 5l3i{W) ■ Xi + 3l32iW) ■ X2 + ^PsiW) -X^ + e, 

where e's are i.i.d. standard Gaussian random variable. 

Under different correlation settings, the comparison MMS between NIS and SIS methods are presented 
in Figure [5] When the correlation gets stronger, independence screening becomes harder. 



5.2. Comparison of Permutation and Conditional Permutation 

In this section, we illustrate the performance the conditional random permutation method. 
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SIS 
s=24 



Fig. 1. Boxplots of minimum model sizes (left to right: NIS, Lasso and SIS) for Example 1 under different true 
models. 



Example 3 Let {Zi, • • • , Zp} be i.i.d. standard normal, {C/i, U2} be i.i.d. standard uniformly distributed 
random variables, and the noise e follows the standard normal distribution. We construct {W, X} and Y 
as follows: 

2 — sm(27rT4^) 

We will take ti — t2 — 0, resulting in uncorrelated case and ti = 3 and t2 = 1, corresponding to 
coir{Xj,Xk) = 0.43 for all j ^ k and corr(Xj, W) = 0.46. By taking q = 1 (i.e., take the maximum value 
of the marginal utility of the permuted estimates), we report the average of the true positive number 
(TP), model size, the lower bound of the marginal signal of true variables and the upper bound of the 
marginal signal of false variables for different correlation settings based on 200 simulations. Their robust 
standard deviations are also reported therein. 

Based on Table [H we see that when the correlation gets stronger, although sure screening properties 
can be achieved most of the time via unconditional {K = 0) random permutation thresholding, the 
model size becomes very large and therefore the false selection rate is high. The reason is that there is 
no differentiability between the marginal signals of the true variables and the false ones. This drawback 
makes the original random permutation not a feasible method to determine the screening threshold in 
practice. 

We now applied the conditional permutation method, whose performance is illustrated in Table [1] 
for a few choices of tuning parameter K . The screening threshold is taken as Tq with q = 1. Generally 
speaking, although the lower bound of the true positives' signals may be smaller than the upper bound 
of false variables' signals, the largest K norms still have a high possibility to contain at least some true 
variables. When conditioning on this small set of more relevant variables, the marginal contributions 
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11=0,12=0 



Fig. 2. Boxplots of minimum model sizes (left: NIS, right: SIS) for Example 2 under different correlation settings. 



Table 1. Model size and marginal signals under different correlation settings (Example 3) 



Model 


TP 


Size 


min u^j 

jeM'\Mo 


max S* 


max 

jS{l,---,p}\A4o 


tl=3,t2 = l 


4.00(0) 
4.00(0) 


6.68(2.99) 
886.49(88.81) 


2.96(0.72) 
0.61(0.10) 


1.22(0.18) 
0.58(0.07) 


1.12(0.15) 
0.22(0.03) 


= 3,t2 = 1 


4.00(0) 
4.00(0) 


5.70(1.49) 
202.50(154.85) 


2.83(0.57) 
0.28(0.06) 


0.75(0.10) 
0.20(0.03) 


0.72(0.11) 
0.11(0.02) 


il=0,f2=0 
tl = 3,t2 = 1 


4.00(0) 
4.00(0) 


5.14(1.49) 
4.98(0.75) 


NA 
0.16(0.05) 


0.06(0.01) 
0.05(0.01) 


0.06(0.01) 
0.06(0.01) 


^ ^ = 0,t2 = 
tl = 3,t2 = l 


4.00(0) 
3.99(0) 


8.92(0.75) 
8.43(0.75) 


NA 
0.11(0.03) 


0.05(0.01) 
0.04(0.01) 


0.05(0.01) 
0.05(0.01) 



of false positives get weaker. Note that in the absence of correlation, when K > s (here s — 4), the 
first K variables have already included all the true variables (i.e., Ai*\Aio = 0), hence the minimum 
of true signal is not available. In other cases, we see that the gap between the marginal signals of true 
variables and false variables become large enough to differentiate them. Table [T] shows that by using 
the thresholding via the conditional permutation method, not only the sure screening properties are still 
maintained, but also the model sizes are dramatically reduced. 



5.3. Comparison of Model Selection and Estimation 

In this section we explore the performance of Conditional- INIS and Greedy- INIS method. In our iterative 
framework, conditional permutation serves as the initialization step (step 0) and we take K = 5 in the 
rest of the paper. For each method, we report the average number of true positive (TP), false positive 
(FP), prediction error (PE), and their robust standard deviations. Here the prediction error is the mean 
squared error calculated on the test dataset of size n/2 — 200 generated from the same model. As a 
measure of the complexity of the model, signal-to-noise- ratio (SNR), defined by var(/3^(VK)X)/var(e), 
is computed. Table [2] reports the results using the simulated model specified in Example 3. We now 
illustrate the performance by using another example. 
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Table 2. Average values of the number of true positives (TP), false positives (FP), and 
prediction error (PE) for simulated model in Example 3. Robust standard deviations 
are given in parentheses. 



ivioaei 


Correlation 


Conditional-INIS 


Greedy-INIS 




X's 


X's-W 


TP 


FP 


PE 


TP 


FP 


PE 


tl=0,t2=0 








4 


0.54 


1.10 


4 


13.01 


1.41 


(SNR « 16.85) 






(0) 


(0.75) 


(0.05) 


(0) 


(3.73) 


(0.17) 


ti=2,t2 = 


0.25 





4 


0.20 


0.78 


4 


0.41 


1.10 


(SNR « 3.66) 






(0) 


(0) 


(0.06) 


(0) 


(0) 


(0.05) 


ti = 2,fo = 1 


0.25 


0.36 


3.97 


0.26 


1.27 


3.90 


0.14 


1.63 


(SNR « 3.21) 






(0) 


(0) 


(0.24) 


(0) 


(0) 


(0.41) 


ti=3,t2 = 


0.43 





4 


0.19 


1.03 


3.99 


0.57 


1.22 


(SNR « 3.32) 






(0) 


(0) 


(0.06) 


(0) 


(0) 


(0.07) 


tl=3,t2 = l 


0.43 


0.46 


3.95 


0.31 


1.30 


3.77 


0.27 


1.29 


(SNR ^ 2.81) 






(0) 


(0.75) 


(0.12) 


(0) 


(0) 


(0.17) 



Table 3. Average values of the number of true positives (TP), false positives (FP), 
and prediction error (PE) for the model in Example 4. Robust standard deviations are 
given in parentheses. 



Model 


Correlation 


Conditional-INIS 


Greedy-INIS 




X's 


X's-W 


TP 


FP 


PE 


TP 


FP 


PE 


tl=0,t2 = 








8 


0.21 


1.24 


8 


10.71 


1.57 


(SNR ^ 47.68) 






(0) 


(0) 


(0.09) 


(0) 


(3.73) 


(0.20) 


ti=2,t2 = 


0.25 





8 


0.13 


1.17 


8 


0.60 


1.16 


(SNR ^ 9.40) 






(0) 


(0) 


(0.09) 


(0) 


(0) 


(0.10) 


tl=2,t2 = l 


0.25 


0.36 


7.80 


0.20 


2.16 


7.55 


0.26 


2.26 


(SNR ^ 8.62) 






(0) 


(0) 


(0.58) 


(0.75) 


(0) 


(0.70) 


ti=3,t2 = 


0.43 





7.90 


0.10 


1.21 


7.98 


0.71 


1.29 


(SNR « 8.18) 






(0) 


(0) 


(0.12) 


(0) 


(0) 


(0.10) 


tl =3,t2 = 1 


0.43 


0.46 


7.75 


0.18 


1.65 


7.35 


0.28 


1.84 


(SNR « 7.61) 






(0) 


(0) 


(0.26) 


(0.75) 


(0) 


(0.42) 



Example 4 Let {W^ X} , Y and e be the same as in Example 3. We now introduce more complexities in 
the following model: 

Y = -iW ■ Xi + {W + lf ■ X2 + {W ~ 2f ■ X3 + 3(sin(27rW^)) • Xi 
+ exp(TK) ■X5 + 2-Xe + 2-Xr + SVW -Xs + e. 

The results are present in Table [31 

Through the examples above, Conditional-INIS and Greedy-INIS show comparable performance in 
terms of TP, FP and PE. When the covariates are independent or weakly correlated, sure screening is 
easier to achieve and false positive is rare; as the correlation gets stronger, we see a decrease in TP and an 
increase in FP. It seems that Greedy-INIS selects slightly more false positives than Conditional-INIS, the 
reason being that in each step Greedy-INIS selects the top variable(s) by fitting the residuals conditional 
on previously chosen variable set and tends to overfit. However, the coefficient estimates for these false 
positives are fairly small, hence they do not affect prediction error very much. Regarding computation 
efficiency, Conditional-INIS performs better in our simulated examples, as it usually only requires two 
to three iterations, while Greedy-INIS would need at least s/po iterations (here po — I and s — A and 8 
respectively for Examples 3 and 4). 



5.4. Real Data Analysis on Boston Housing Data 

In this section we illustrate the performanc e of our method through a real data analysis on Boston 
Housing Data ( Harrison and Rubinfeld. 19781 ). This dataset contains housing data for 506 census tracts 
of Boston from the 1970 censu s. Most empirical results for the housing value equation are based on a 
common specification (Harrison and Rubinfeld. 1 978). 



fog(MV) = /3o + /3i RM^ + /^2AGE + /?3 log(DIS) + Pi log(RAD) + /S^TAX 
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Table 4. Prediction error (PE) , model size and selected noise vari- 
ables (SNV) over 1 00 repetitions and their robust standard deviations (in 
parentheses) for Conditional-INIS (p = 1000), Greedy-INIS (p = 1000), 
and SCAD fit (p=12). 



method 


PE 


Size 


SNV 


Conditional-INIS (p = 1000) 


0.046(0.020) 


5.55(0.75) 


0(0) 


Creedy-INIS (p = 1000) 


0.048(0.020) 


4.80(1.49) 


0.01(0) 


SCAD fit (p=12) 


0.052(0.019) 


6.05(1.87) 


NA 



+;36PTRATIO + /37(B - 0.63)^ + log(LSTAT) + /JgCRIM 
+AoZN + /^iiINDUS + /3i2CHAS + AsNOX^ + e, 

where the dependent variable MV is the median value of owner-occupied homes, the independent variables 
are quantified measurement of its neighborhood whose description can be found in the manual of R 
package mlbench. The common specification uses RM^ and NOX^ to get a better fit, and for comparison 
we take these transformed variables as our input variables. 

To exploit the power of varying coefficient model, we take the variable log(DIS), the weighted distances 
to five employment centers in the Boston region, as the exposure variable. This allows us to examine 
how the distance to the business hubs interact with other variables. It is reasonable to assume that the 
impact of other variables on housing price varies with the distance, which is an important characteristic 
of the neighborhood, i.e. the geographical accessibility to employment. Interestingly, Conditional-INIS 
selects the following submodel: 

log(MV) = /3o {W) + /3i {W) ■ RM^ + ( W) • AGE 4- /^s {W) ■ TAX 

+P7{W) ■ (B - 0.63)^ -I- Pg{W) ■ CRIM -I- e, (35) 

where W = log(DIS). The estimated functions /3j(PF)'s are presented in Figure[3] This varying coefficient 
model shows very interesting aspects of housing valuation. The evidence of nonlinear interactions with 
the accessibility is clearly evidenced. For example, RM is the average number of rooms in owner units, 
which represents the size of a house. Therefore, the marginal cost of a big house is higher in employment 
centers where population is concentrated and supply of mansions is limited. The cost per room decreases 
as one moved away from the business centers and then gradually increases. CRIM is the crime rate in 
each township, which usually has a negative impact, and from its varying coefficient we see that it is a 
bigger concern near (demographically more complex) business centers. AGE is the proportion of owner 
units built prior to 1940, and its varying coefficient has a parabola shape: positive impact on housing 
values near employment centers and suburb areas, while negative effects in between. NOX (air pollution 
level) is generally a negative impact, and the impact is larger when the house is near employment centers 
where air is presumably more polluted than suburb area. 

We now evaluate the performance of our INIS method in a high dimensional setting. To accomplish 
this, let {Zi^ • • • , Zp} be i.i.d. the standard normal random variables and U follow the standard uniform 
distribution. We then expand the data set by adding the artificial predictors: 

Zj + tu 
l + t 



^ \ ', r = ■s + l^■■,^'• 



Note that {W, Xi, • • • , ATs} are the independent variables in original data set (s — 13 here) and the 
variables {_^jYj=s-\-\ ^-""^ known to be irrelevant to the housing price, though the maximum spurious 
correlation of these 987 artificial predictors to the housing price is now small. We take p — 1000, t = 2, 
and randomly select n = 406 samples as training set, and compute prediction mean squared error (PE) 
on the rest 100 samples. As a benchmark for comparison, we also do regression fit on {W, ATi, • • • ,Xs} 
directly using SCAD penalty without screening procedure. We repeat N — 100 times and report the 
average prediction error and model size, and their robust standard deviation. Since {Xj}^^^^^ are 
artificial variables, we also include the number of artificial variables selected by each method as a proxy 
for false positives. The results are presented in Table El 

As seen from Table |4l our methods are very effective in filtering noise variables in a high dimensional 
setting, and can achieve comparable prediction error as if the noise were absent. In conclusion, the 
proposed INIS methodology is very useful in high-dimensional scientific discoveries, which can select a 
parsimonious close-to-truth model and reveal interesting relationship between variables, as illustrated in 
this section. 
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Fig. 3. Fitted functional estimates I3j{wys selected by Conditional-INIS. 
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Appendix 

A.1. Properties of B-splines 

Our estimation use the B-spline basis, which have the following properties (de Boor 1978): For each 
j = 1, • • • ,p and fc = 1, • • • , i„, BkiW) > and T,t=i Bk{W) = 1 for W eW. In addition, there exist 
positive constants T3 and T4 such that for any 77^ G M, ft — I, ■ ■ ■ , Ln, 

Ln'n Y.'il< / E ^t^Buiw) dw < L-'T, ^ 4. (36) 

k=l •' \k=l I fc=l 

Then under condition (iii), there exist positive constants C\ and C2 such that for fc = 1, ■ ■ ■ 

C^L~^ < E[BliW)] < C2L-\ (37) 

where Ci = T1T3 and C2 = T2Ti. 

Furthermore, under condition (iii), it follows from ([55]) that for any 77 = (771, •• • ,r]L„)'^ S such 
that ||r7||2 = 1, 
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Or equivalently, 

CiL-^ < A™„(E[B^B]) < A,„a..(E[B^B]) < CaL^^ (38) 
A. 2. Technical Lemmas 

Some technical lemmas needed for our main results are shown as follows. Lemma [T] and Lemma [2] give 
some characterization of exponential tails , which becomes handy in our proof. Lemma [3] and Lemma |3] 
is a Bernstein type inequality. 

Lemma 1. Let X, W be random variables. Suppose X has a conditional exponential tail: P{\X\ > 
t\W) < exp(l — {t/ KY) for all t > and uniformly on the compact support of W , where K > Q and 
r > 1. Then for all m> 2, 

E{\Xr\W) < emK"'m\. (39) 

Proof. Recall that for any non-negative random variable Z, E[Z|T4^] = P{Z > t\W}dt. Then we 
have 



/•OO 

E{\xr\w) = / p{\xr>t\w}dt 

Jo 



< / cxp(l - (t^l'^/KY)dt 
Jo 

emK™ , m , 

= — r(-)- 



The lemma follows from the fact r > 1. 



Lemma 2. Let Z\, Zi and W be random variables. Suppose that there exist Ki, K2 > and ri, 
r2 > 1 such that r\r2/(ri + r2) > 1, and 

P{\Z,\ > t\W) < cxp(l - it/K,y^), i^l,2 

for all t > and uniformly on W. Then for some r* > 1 and K* > 0, 

P(|ZiZ2| > t\W) < exp(l - {t/K*Y') (40) 

for all t > and uniformly on W. 

Proof. For any t > 0, let M ~ {tK2^^^^ / Ki) '■i+'-2 and r = rir2/(ri + r2). Then uniformly on W, we 
have 

F{\ZiZ2\ > t\W) < P{M\Zi\>t\W) + P{\Z2\> M\W) 

< exp{l - {t/KiMY^} + exp{l - {M/K2Y'} 
= 2eM^-it/KiK2Y}- 

Let r* e [l,r] and K* = max{{r* /ry/'' K1K2, (1 + \og2Y/'' K1K2}. It can be shown that G{t) = 
{t/KiK2Y ~ {t/K*Y' is increasing when t > K* . Hence G{t) > G{K*) > log 2 when t > K*, which 
implies when t > K*, 

P{\ZiZ2\ > t\W) < 2exp{l - {t/KiK2Y'} < exp{l - {t/K*Y'}. 

On the other hand, when t < K* , 

P(|ZiZ2| > t\W) < 1 < exp{l - {t/K*Y^}. 

Lemma [2] holds. 

Lemma 3. (Bernstein inequality, lemma 2.2.11, van der Vaart and Wellner (199^ )). For indepen- 



dent random variables Yi,---,Yn with mean zero such that E[|li|™] < mlM™ '^Vi/2 for every m > 2 
( and all i ) and some constants M and Vi . Then 

P{\Yi + • • • + y„| > x) < 2 exp{-xV(2(z. + Mx))}, 

for V >Vi-\ l-Vn. 
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Lemma 4. (Bernstein's inequality, lemma 2.2.9, van der Vaart and Wellner (199^ )). For indepen- 
dent random variables Yi, ■ ■ ■ ,Yn with hounded range [— M, M] and mean zero, 

P(|Yi + ••• + r„| >x)< 2cxp{-a;V(2(t/ + Mx/3))}, 

for V > var{Yi + ■ ■ ■ + Yn). 

The following lemmas are needed for the proof of Theorem [TJ 

Lemma 5. Suppose conditions (i) and (iii)-(vi) hold. For any S > 0, there exist some positive con- 
stants hi and &2 such that for j — 1, - ■ ■ ,p, k — \, - ■ ■ , Ln, 



P 



n + b2S) J ' 



and 



1 " 

-yBk{Wi)Y,~E[BkY] 



> - < 4 cxp <^ - 



hiLn n + h2S 



Proof. RecaU m(X*) = E{Y,\Xi,W,). Let Z^kt = Xj,Bk{Wi)m{X*) - E[XjBk{W)m{X*)] and 
Cjfe. = X,MW,)e,. Then 

n 

- J2 X,MW,)Y, - E[XjBk{W)Y] 

i=l 

-Y,(x,,Bu{Wi)m{X*) - E[X,Bk{W)mOC)]+ XjMWi)e, 

i=l 

n \ ^ 

n ^ 

1=1 i=l 

We first bound ^ ^jki- Note that for each j and k, {Zjki}^=i are a sequence of independent 

random variables with mean zero. By condition (v), p7l) . and Lemmas [T] and [H we have for every m > 2, 
there exists a constant K^i > 0, such that 

E|Z,fc,r < 2"'E\X,,Bk{W,)m{X*)r 

< r^E[B^{wmx,M^:)rm] 



< 2"'E[B^k{W^)emK^m\] 

< m\{2K4r-\8emKlC2L;^^)/2, 



(41) 



where the first inequality comes from the Minkowski inequality. Hence, it follows from Lemma |3] that for 
any S > 0, 



/, 1 " 



> — < 2 exp 
- 2n - ^ 



MemKlC2Ln^n + %Ki5 



(42) 



Next we bound ;^X]"=i'^«- Again ^^'s are centered independent random variables. By conditions 
(v)-(vi), ([57)) . and Lemmas [T] and [U we have for every to > 2, there exists a constant K^, > 0, such that 

ElC.r = E[B^Cm)E[\X,,e,r\W,]\ 
< m,\K'^-^{2emKlC2L-^)/2. 



Thus, according to Lemma [3j 
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Similarly, we can show that 

P - ^ BkiW,)m{-K*) - E[BkiW)m{X*)] 



< 2 exp <; - 



> 



2n 



64:emKW2Ln^n + 8K2S 



and 



p[\'-±B.iW.)s.\>l^ 



< 2 cxp 



l6emK^C2Ln^n + AK3S 



(44) 



(45) 



Let 61 = 16emC2 max{4:Kj, K^, AKl, Kl) and 62 = max(8iir4, AK^, 8K2, 4/^3). Then, the combination 
of - dJS]) by union bound of probability yields the desired result. □ 

Lemma 6. Under conditions (i), (Hi) and (v), tliere exist positive constants C3 and C4, such that for 
J = Ir • ■ ,P, 

C3L-' < A„„„(E[QjQ^-]) < A„,,(E[QjQ^.]) < C^L-^ (46) 
Proof. Recall that Q = {B,XjB). For any r] = {vf,vlf e IR^^" such that \\r]\\l = 1, 



»7^E[QjQ,]r, = E 



(B»7i,Br72 



1 m^m \ ( B771 



Consider eigenvalues Ai and A2 (Ai > A2) of the 2x2 middle matrix on the right hand side of the equation 
above, we have Ai + A2 — 1 + E[X^|M^] (trace) and Ai • A2 = Var[Xj|VF] (determinant). Therefore, by 
Lemma [T] 

and by assumption (i) 



Ai < 1 + E[X2|iy] < 1 + 4:eKl 
\a.v[Xj\W] ^ hi 



A2 > 



> 



Using the above two bounds on the minimum and maximum eigenvalues, we have 

^-^^E[(B,7i)' + [Br^^f] < rj^mjQj]v < (1 + ^eKf)E[{Brj,)^ + (Brj^f] 



By dSHl), we have 

^^^L-i < ^^E[QjQ^.]r, < (1 + AeKf)C2L-\ 
Take C3 = hiCiL-^/{l + AeKf) and C4 = (1 + 4:eKf)C2L-'^ , resuh follows. 



Throughout the rest of the proof, for any matrix A, let || A|| = y Amax(A"^A) be the operator norm 
and |lA||oo = max^j \ Aij \ be the infinity norm. 

Lemma 7. Suppose conditions (i), (Hi) and (v) hold. For any 5 > and j = l,---,p, there exist 
some positive constants 63 and 64 such that 



1 



Q^,Q„, ~ E[Q[Q,] > L^S/n < 6Ll exp - 



bsLn^n + biS 



and 



-B^B„-E[B^B] 



> Ln5/n I < 6L^exp 



63^,1 n + 64^ 
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In addition, for any given positive constant 65 , there exists some positive constant 6g such that 
,1 



P 



(-Q«,Q. 



njQnj) I 



|(E[QjQ^.])-' 



>65||(E[Q Q, 



and for any positive constant hi, there exists some positive constant fog such that 
1 



P 



(-b;,b„)- 

n 



(E[B^B])- 



Proof. Observe that for j = 1, • • • ,p, 



1 



>&7||(E[B^B])- 



Di D 



'2j 



2i 



where Di = i ^ B^(W^,)B(W,) - E[B^B], T>2, = i E X,»B^(Ty,)B(W',) - E[X,B^B] and D 

n 

i E ^|,B^(W,)B(l^,) - E[X2B^B]. Then 

i=l 

\\\f^l,Clnj - E[QjQ,]|| < 2L„||iQ^^.Q„^. - E[QjQ^.]|U 

= 2L„max(||Di||oo, ||D2j||oo, ||D3j||oo)- 
We first bound ||Di||oo. Recall that < Bk{-) < 1 on W, so 

\Bk{W,)BiiW,) -E[BkiW)BiiW)]\ < 2, 

for aU k and / By ([gT]). 

YaTiBk{W,)Bi{W,) -E[BkiW)BiiW)]) < nBl{W)BfiW)] < C2L-\ 
By Lemma [5J we have 

P |^|^fjBfc(W^,)^/(W'.)-E[Sfc(M^)Bz(W^)]| > 5/6nj 

< 2exp{-(5V(72C2i;^^n + 24(5)}. 

It then follows from the union bound of probability that 

-P(||Di||oo > < 22.2 exp{-5V(72C2£,T^n + 24(5)}. 

We next bound ||D2j||oo- Note that for fc, Z = 1, • • • , L„, 

Y.[\X,,Bu{W,)Bi{W,) - Y.[X,Bu{W)Bi{W)]r] 
< 2"'E[\XjMWi)Bi{W,)n 

= 2"'E[E[\X,,r\W,Wk{W,)] 

where Lemma [T] was used in the last inequality. By Lemma [31 we have 

P (^\^J2x,,BkiW,)Bi{W,) - nX,Bk{W)Bi{W)]\ > d/6n^ 

< 2 ejip{-d^ /{576emKlC2L-^n + 24Kid)}. 
It then follows from the union bound of probability that 

P{\\'D2j\\oo > S/6n) < 2Lle^p{-5^ /{576emKfC2L-^n + 24KiS)}. 
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Similarly we can bound ||D3j||oo. For every m > 2, for k,l — I, ■ ■ ■ , Ln, there exists a constant Kq > 
such that 

E[\Xf,Bk{W,)BiiW,) - E[X]Bk{W)Bi{W)]r] 

2 \m\ 



By Lemma [21 we have 



< 2'"E[E[|X/ir|Wi]B^(iyO] 

< ml{2K6r-^{SemK^C2L-^)/2. 



P {\XlBk{Wi)Bi{W,) ~ F.[Xpk{W)Bi{W)]\ > S/6n) 



It then follows from the union bound of probability that 

P{\\T)3j\\oo > S/6n) < 2Llcxp{-Sy{576emKlC2L-^n + 24KeS)}. 



(50) 



Let 63 = 72C2ina,x{l,8emK^,8emKl} and 64 = 24:ma.x{l, Ki, Kq}, then combining (glll-dSni) we 
have 



P 



1 



rQn.Qn, - E[Qf Q,] > Ln5/n ] < 6L^ exp 



Observe that ||iB„B„ - E[W B]|| < 2i„||Di||oo- Thus, we have also proved that 

5^ 



-B^B„ - E[B^B] > LJ/n ] < 61^ exp 
n 



bzLn^n + bi6 



(51) 



(52) 



We next prove the sec ond part of the lemma. Note that for any symmetric matrices A and B 
(|Fan. Feng and Song. 2Qllh . 



I -^min 

(A) - An,in(B)| < max{|A,„i„(A - B)|, |A„,in(B - A)|}. 
It then follows from ([55)) that 



(53) 



Amin(-Q^j Q„j) - Aniin(E[QjQj]) 



< 2i„ 



q^,q„,-e[qJq^ 



which implies that 



P 



Amin(-Q^jQ„j) - Aniin(E[QjQj]) 



< GLlexpi-SyibsL-^n + biS)}. 
Let 6 = bgC3L:^^'^n in ([Sl|) for 69 e (0, 1). According to we have 



(54) 



P 



1 



Amin(-Q,yQ,y) - Anim(E[Qj Q^]) 



> 69Ami„(E[Qf Q,]) 



< 6Llexp{-beL-^n), (55) 

for some positive constant feg- Next observe the fact that for x,y > 0,a E (0,1) and 6 = 1/(1 — a) — 1, 

\x~^ — y~^\ > by^^ implies |a; — yj > ay. 

This is because x^^ — y^^ > by^^ is equivalent to x~^ > jz^y^^, or x ^ y < —ay; on the other hand, 
x~^ — y^^ < by^^ implies x^^ < (1 — iZ^)y^^ < (1 ^ TTa)^^^ as a £ (0, 1), and therefore x ~ y > ay. 



Then let 65 = 1/(1 - 69) - 1, it follows from ([55|) that 

(An.i„(^Q^,Q„,))-i - (A.„i„(E[QjQ^.)])" 
< 6Llexp{-b6L~^n). 



> fo5(A„,in(E[Q[Q,])) 



(56) 
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Following the same proof, by (|38p we also have for any positive constant &7, there exists some positive 
constant 6g, such that 



1 



(A„,i„(-B^B„))"' - (A„,i„(E[B^B)])-i 



> 67(A.„i„(E[B^B]))-i 



< 6Llexp{-bsL„^n). 



(57) 



The second part of the lemma then follows from the fact that for any symmetric matrix A, Amin(A) ^ 
Amax(A-^). □ 



A.3. Proof of Main Results 

Proof of Proposition [TJ Note that E[y|T4^,Xj] = aj{W) + bj{W)Xj. Bv lStone QQSl . there exist 
{a*}^^Q and G 5„ such that \\aj — a*||oo < M2L:^'^ and \\bj — 6*||oo < M2L~'^, where 5„ is the 

space of polynomial splines of degree I > 1 with normalized B-spline basis {-Bfe, fc = 1, • • • , and M2 
is some positive constant. Here || • ||oo denotes the sup norm. Let r/* and 6* be L„-dimensional vectors 
such that for a*iW) = B{W)r]* and b*{W) = Bj{W)e*. 

Recah that aj{W) = B{W)fij and bj{W) = B{W)ej. By definition of f)^ and Oj, we have 

{a„b,) = arg min E[{Y - a,{W) ~ b,{W)X,f] 

= arg inin E[(E[r|M/, X,] - a,{W) - b,{W)Xjf], 

aj ,0j GOn 

and therefore ||E[y|W^,Xj] - - bjXj\\^ < \\E[Y\W,Xj] - a* - b*Xj\\^. In other words, 

\\a, + b,X, - {a, + b,Xj)\\^ < II (a* + b*X,) - {a, + bjX,)\\' 

< 2||a,-a*||2 + 2||(6,-6*)X,|p 



< 



2MiL-'^{l + nxf]). 



On the other hand, by the least-squares property, 

E[{Y~a,~b,X,){~a,+b,X,)] = 0, 
and by conditioning in Wj and Xj, we have 

E[{Y~a,~b,X,){~a,+b,X,)] = 0. 
The last two equalities imply that 

E[(a, + b,Xj - a, - b,Xj)id, + bjXj)] = 
Thus, by the Pythagorean theorem, we have 



a 



J + bjX.f = ||a, + b,x,f + ||a, + b,x, - a, - b.x.W", 



and 



Similary, we have 



a, + bjX.r - \\aj + b,X,r < 2MiL-'^il + E[Xf]). (58) 



||«o|P - ||«o||' < MIL-'''. (59) 

By taking Mi = Ml{8eK' + 3) (c.f. Lemma[I|), the first part of Proposition^ follows from ^ and 

Uj-Uj = \\aj + bjXj\f - \\aof - i\\aj +bjXj\f - \\aof ) 

< MiL-^'K (60) 
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By (123) and we have 

min Uj > ciL„n^^''/ft.2 - MiL:^"^"^. 

Then the desired result follows from L''^'^-'^ < ci(l//i2 - O^^^'^/Mi for some ^ S (0, l//i2). □ 
Proof of Theorem [1] We first prove part (1). Note that 

Unj - Uj = Si+ 5*2, 

where 

5i - iY^Q„^.(Q^^.Q„^.)"'Q^,Y-E[yQ^.](E[QjQ^.])"'E[Qjy], 
^2 = -Y^B„(B^B„)-iB^Y-E[rB](E[B^B])-iE[B^y]. 



and 



n 



We first focus on Si. Let a„ - ^Q„jY, a = E[Qjy], U„ = (iQ^^.Q„j-)"' and U = (E[QJQ^.])-i. 
Then 

S"! = a^U„a„ - a^Ua 

= (a„ - a)'^U„(a„ - a) + 2(a„ - a)^U„a + a^(U„ - U)a. 

Denote the last three terms respectively by 5*11, 6*12, and 6*13. 
We first deal with Sn. Note that 

l^iil < ||U„|| • ||a„-a||2. (61) 

By Lemma [5] and the union bound of probability, 

P{\\sL.n-&\\l>2L,,5^/r?) < 8LnCxp{-S^/{biL-^n + b25)}. (62) 

According to the second part of Lemma [7J for any given positive constant 65 , there exists a positive 
constant 6g such that 

P(||1U„|1 - ||U||| > 65IIUII) < 6LleM-beL;M- 
Then it follows from gB]) that 

P(||U„|| > (65 + l)C^'L„) < 6LleM-b6L-M- (63) 
Combining (j6ip -(l63 p and based on the union bound of probability, we have 

< 8Lnexp{-SyibiL-^n + b2S)} + 6Llexp{-b6L-^n}. (64) 
We next bound Si2- Note that 

|^i2|<2|la„-a||2-|lU„||.||a||2 (65) 

By Lemma [TJ 

||a||^ = ||E[B^r]||2 + ||E[X,B^y]||i 

k=l k=l 

in 

fc=i 

< 4eC2{Ki+Kl), (66) 
where the calculation as in (gT]) was used. 
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It follows from (1551) . and the union bound of probability that 

< 8Lr,exip{-5^/{biL~\ + b2S)} + QLlexp{-beL-^n}. (67) 

To bound 6*13, note that 

|5i3| - a^U„(U-i - U,-i)Ua < ||U„||2 • - lJ-^\\ ■ ||a||i (68) 

Then it follows from Lemmas [6l Lemma [3 ((63|) , ((66|) , (j68l) and the union bound of probability that there 
exist 63, 64 and 65 such that 

P{\Si3\ > AeC2{Kl + Kl){h^ + lfC^^Ll5/n) 
< 6Llexp{~S^/{b3L-^n + biS)} +6Llexp{-beL-^n}. (69) 



Hence, combining (|64|. (|67l) and ((69)) . there exist some positive constants si, S2 and S3 such that 

P {\Si\ > siLlS^n^ + S2Ll/^S/n + s^LlS/n^ 

< 16i„ exp{^S^/{biL-^n + 62<5)} + 6LI exp{~5^ / {bsL-^n + b^S)} 
+18Llexp{-b(iL-^i}. (70) 

Similarly, we can prove that there exist positive constants S4, S5 and sg such that 

P {\S2\ > SiLl5^/n^ + ssLl/^S/n + s^LlS/n) 

< 8L„ exp{~6'^/{biL-^n + 62<5)} + 6LI exp{- 5^ I {b^^L-^ n + b^S)} 
+l8Llex-p{-bsL~'^n]. (71) 

Let (si + S4)L^5^/n^ + (s2 + s<i)Lf/^S/n + (§3 + se)L^^S/n = C2i„n"^'' for any given C2 > (e.g., 
take 6 = C2L^'^n^^^'^ / {s^ + sq)). There exist some positive constants C3 and C4 such that 

P {\Unj -Uj \ > C2£„n"^'') 

< (24i„ + 12Ll) exp{~C3n^~^''L-^} + 36LI exp{-CiL-^n}. (72) 

Then Theorem [iji) follows from the union bound of probability. 
We now prove part (ii). Note that on the event 



An = S max \Unj - uA < Ci^LnU "72 } , 

by Proposition [U we have 

Unj > ci^i„n-2'"/2, for aU j G M*. (73) 

Hence, by choosing r„ = ci^i„n^^''/2, we have M.* C A^r„- On the other hand, by the union bound of 
probability, there exist positive constants cg and C7, such that 

PiA^) < Sn{i24Ln + l2Ll)exp{-CGn^-^''L-^)+36Llexp{~cyL~^n)} , 

and Theorem [Ii; 2) follows. □ 
Proof of Theorem [2] Let 

a = argimnE[(F— Qa)^], 
where Q — (Qx, • • • , Qp) is a 2pL„-dimensional vector of functions. Then we have 

E[Q^(r-Qa)] =02pL„, 
where 02pi„ is a 2pL„-dimension vector with all entries 0. This implies 

|lE[Q^y]||2 = a^S^a < A„,ax(S)a^Sa, 
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recalling S — E[Q"^Q]. It follows from orthogonal decomposition that Var(QQ:) < Var(F) and E[Qq:] = 
E[F] (recall the inclusion of the intercept term). Therefore, 



and 

Note that by the definition of Uj, 

p p 



d^Sa < E[y2] ^ 0(1), 
lE[Q^r]||2 = 0(A„,ax(S)). 



(74) 



^u, = ^E[yQ,.] (E[QjQ,.])-^- 



(E[B^B])-i 



mjY] 



< 



max A,„ax{(E[QjQ,.])-i} J] ||E[Qjy]||2 



= max A,„ax{(E[QjQ,])-i}||E[Q^y]|i2. 

i<j<p 

By Lemma [6] and ((74)) . the last term is of order 0(L„Amax(S)). This implies that the number of 
{j : Uj > SL„n^^'^} cannot exceed 0(n^''Aniax(S)) for any S > 0. 
On the set 



Bn = < max \Unj — Uj \ < SLnJ 



i<j<p 



the number of {j : Unj > 25L„n ^'^} cannot exceed the number of {j : Uj > JL„n ^"j, which is bounded 
by 0(n'^''A,„ax(S)). By taking 6 = we have 

P[\MrJ < 0(n2'=A„,ax(S))} > P(6„). 

Then the desired result follows from Theorem [TJi). □ 
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